Conversation
Loop runner was discarding its partial-success results array on the
error return path: any per-iteration runner failure caused the step to
return a non-nil error, which the response builder treats as "drop
OutputData", so the SDK saw data: null instead of [null, {...}].
Distinguish infrastructure failures (queue submit errors, iteration
timeouts) from per-iteration runner failures. Only the former fail the
loop step; the latter leave results[i] = nil as a partial-success
placeholder and the loop returns success with the array preserved.
Fixes #511.
…ssor (#510) When on_chain_nonce < userOp.Nonce and the bundler returned AA25, the conflict handler always assumed a predecessor UserOp would eventually mine. If the predecessor had been dropped from the bundler mempool, the cache stayed stale, the rebuild loop kept reusing the same nonce, and every subsequent paymaster-sponsored UserOp for that wallet failed until the aggregator was restarted. Now inspect the bundler mempool: if no pending UserOp exists in [on_chain, userOp.Nonce, reset the nonce cache and rewind to the on-chain nonce so the rebuild self-heals. EOF )
Code ReviewOverall this is a well-motivated pair of fixes with clear comments and good separation of infra vs. per-iteration failures. A few items worth addressing before merge. 1. No test coverage for the nonce desync recovery path (
|
| # | Severity | File | Issue |
|---|---|---|---|
| 1 | Medium | builder.go |
No test for nonce desync recovery (3 branches uncovered) |
| 2 | Medium | builder.go:1169 |
Unbounded context.Background() for mempool inspection |
| 3 | Low | builder.go:1179 |
Silent skip on unparseable nonce, no debug log |
| 4 | Low | builder.go:1172 |
Decimal-looking nonces misinterpreted as hex |
| 5 | Low | vm.go:4853 |
success is always true at final finalizeStep; misleading |
| 6 | Low | manual_trigger_loop_test.go |
Missing step.Error == "" assertion |
The logic is sound — items 1 and 2 are the ones I'd prioritize before shipping.
There was a problem hiding this comment.
Pull request overview
This PR addresses two reliability issues: (1) recovering from ERC-4337 paymaster nonce-cache desyncs that previously caused repeated rebuild loops, and (2) preserving Loop node partial-success output when individual iterations fail, instead of failing the entire Loop step.
Changes:
- Add bundler-mempool inspection during AA25 nonce-conflict handling to detect missing predecessor UserOps and rewind the nonce cache to the on-chain nonce (#510 recovery path).
- Update
executeLoopWithQueueto only hard-fail the Loop step on infrastructure failures (queue submit errors / timeouts), while keeping per-iteration runner failures asnilentries in the result array. - Update Loop test expectations to match the new “partial success preserved” semantics; document Conventional Commits requirement in
CLAUDE.md.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| pkg/erc4337/preset/builder.go | Detects nonce-cache desync via bundler mempool and rewinds to on-chain nonce to avoid stuck rebuild loops. |
| core/taskengine/vm.go | Changes Loop failure semantics to preserve partial results unless infrastructure fails. |
| core/taskengine/manual_trigger_loop_test.go | Updates test assertions to reflect partial-success Loop behavior. |
| CLAUDE.md | Adds repo guidance requiring Conventional Commits PR titles for semantic-release. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // UserOp exists for this sender at a nonce in [on_chain, userOp.Nonce), | ||
| // then case (b) holds — rewind the cache to the on-chain nonce. | ||
| predecessorPending := false | ||
| if pendingOps, mempoolErr := bundlerClient.GetPendingUserOpsForSender(context.Background(), entrypoint, userOp.Sender); mempoolErr == nil { |
There was a problem hiding this comment.
GetPendingUserOpsForSender is called with context.Background() inside the retry loop. If the bundler RPC hangs (or debug methods are slow), this can stall sendUserOpCore indefinitely and block user requests. Consider using a context with a short timeout (or a request-scoped context passed into sendUserOpCore) for mempool inspection so retries remain bounded.
| if pendingOps, mempoolErr := bundlerClient.GetPendingUserOpsForSender(context.Background(), entrypoint, userOp.Sender); mempoolErr == nil { | |
| mempoolCtx, cancelMempool := context.WithTimeout(context.Background(), 3*time.Second) | |
| pendingOps, mempoolErr := bundlerClient.GetPendingUserOpsForSender(mempoolCtx, entrypoint, userOp.Sender) | |
| cancelMempool() | |
| if mempoolErr == nil { |
| } else { | ||
| l.Warn("Failed to inspect bundler mempool while diagnosing nonce conflict; assuming predecessor pending", | ||
| "error", mempoolErr) | ||
| predecessorPending = true | ||
| } |
There was a problem hiding this comment.
If mempool inspection fails (mempoolErr != nil), the code assumes a predecessor is pending and falls back to the old “wait for prior UserOps” behavior. This can reintroduce the stuck loop from #510 when the bundler doesn’t expose debug_bundler_dumpMempool (or it’s temporarily unavailable). Consider a safer fallback (e.g., after N retries, rewind to onChainNonce, or at least log that recovery is disabled without debug RPC).
| // On-chain nonce is behind our attempted nonce. Two possibilities: | ||
| // (a) Prior UserOps at lower nonces are genuinely pending in the bundler | ||
| // mempool and just haven't mined yet → wait for them. | ||
| // (b) The cache is stale: a previous UserOp was dropped from the bundler | ||
| // mempool without the cache being invalidated, so there is nothing | ||
| // at on_chain..userOp.Nonce-1 to ever mine. Waiting is futile and | ||
| // leads to the stuck loop described in issue #510. | ||
| // Distinguish the two by inspecting the bundler mempool. If no predecessor | ||
| // UserOp exists for this sender at a nonce in [on_chain, userOp.Nonce), | ||
| // then case (b) holds — rewind the cache to the on-chain nonce. |
There was a problem hiding this comment.
This new desync-recovery branch (detecting no predecessor UserOp in the bundler mempool and rewinding the nonce cache) isn’t covered by existing tests in the preset package. Given how failure-prone nonce-handling is, please add a focused test that exercises both cases: (1) predecessor pending → wait, (2) predecessor absent → ResetNonce + use onChainNonce for rebuild (ideally with a stubbed bundler client / extracted helper).
| // Per AvaProtocol/EigenLayer-AVS#511, per-iteration runner failures do not | ||
| // fail the loop step. The loop ran to completion, so step.Success is true | ||
| // and the per-iteration outcomes are reflected in the data array (nil for | ||
| // failed iterations). | ||
| require.NotNil(t, step, "Execution step should not be nil") | ||
| assert.False(t, step.Success, "Loop should report failure when an iteration fails") | ||
| assert.Contains(t, step.Error, "invalid address", "Error should mention invalid address") | ||
| assert.True(t, step.Success, "Loop step should succeed when it ran to completion, even if an iteration failed") | ||
|
|
There was a problem hiding this comment.
The test no longer asserts that the per-iteration failure is observable anywhere (e.g., in step.Log or nested iteration step metadata). Since the loop now reports Success=true on per-iteration runner errors, it’s important to still validate that the failure details are surfaced for debugging/clients (for example: step.Log contains the invalid address error and/or the failed iteration step is marked unsuccessful).
- Bound mempool inspection RPC with a 5s timeout so a hung bundler cannot stall the retry loop. - Drop misleading decimal-fallback nonce parser; bundlers return hex per EIP-4337. Log skipped unparseable nonces at debug level. - Assert step.Error is empty in the Loop partial-success test so the no-error contract is locked in.
Summary
Test plan
🤖 Generated with Claude Code